Blockwise float8 quantizer and quantized tensor class #1513
+4,144
−74
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
Adds pytorch and C++ quantizer and quantized tensor classes for a subchannel quantization scheme.
The classes are configurable for 128x128 blocksize and 1x128 blocksize via setting block_scaling_dim == 2,1 respectively.
Scale tensors are stored in a format emenable for matrix multiplication, however the integration of matmul is deferred as a separate story.
Fusions of quantization and DBIAS or activation functions are not yet implemented, and the dequantization is currently implemented in torch.
Tests for quantization are included in C++ and pytorch layers, with exact comparison to reference quantizer behavior as well as an attempt to hit interesting branches through the API such as tensor creation in pytorch and CPP and dequantization of row and columnwise usage.
Two CUDA kernels for quantization are included.
Type of change
Changes
Please list the changes introduced in this PR:
Checklist that can arguably can be deferred for a future MR:
Tasks that have a dependency on a GEMM and are not included.
Test Instructions
Python tests:
C++ tests:
Checklist: